In this notebook an integrated analysis is preformed. Pair interaction data of miR - target(mRNA) and DNAm - target(mRNA) is an output from notebook 02-Regulation_targets. Integrative analysis is goint go be focused on mRNA so a table with significantly DE and one of its regulator. Since there are some genes that have multiple regulators mir’s with the best correlation will be selected as a regulator. For DNAm !!! get back to this To explore the influence of miR and DNAm on mRNA expression we will preform GSEA on groups of genes that are under regulation of DNAm, mir or both. Next, we will make an unsupervised clustering on genes with FC features to find patterns in gene expressions. (OVO ISTO POBOLJŠATI)
library(venn)
library(dplyr)
Attaching package: ‘dplyr’
The following object is masked from ‘package:Biobase’:
combine
The following object is masked from ‘package:matrixStats’:
count
The following objects are masked from ‘package:GenomicRanges’:
intersect, setdiff, union
The following object is masked from ‘package:GenomeInfoDb’:
intersect
The following objects are masked from ‘package:IRanges’:
collapse, desc, intersect, setdiff, slice,
union
The following objects are masked from ‘package:S4Vectors’:
first, intersect, rename, setdiff,
setequal, union
The following objects are masked from ‘package:BiocGenerics’:
combine, intersect, setdiff, union
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
library(GenomicRanges)
library(gprofiler2)
Registered S3 method overwritten by 'data.table':
method from
print.data.table
Registered S3 method overwritten by 'htmlwidgets':
method from
print.htmlwidget tools:rstudio
library(DESeq2)
library(tibble)
library(kohonen)
library(rprojroot)
# source(file = file.path(getwd(), "scripts", "idMap.R"))
# source(file = file.path(getwd(), "scripts", "integrateOmicsData.R"))
# source(file = file.path(getwd(), "scripts", "kmeansHeatmap.R"))
genes_dir <- file.path(find_rstudio_root_file(), "Scripts", "genes.hg19.ensembl.rds")
dir_dest1 <- "/home/katarina/MyProjects/HNSCC/01-Exploratory_analysis_and_differential_expression"
dir_dest2 <- "/home/katarina/MyProjects/HNSCC/02-Regulation_targets"
## 01
# mRNA
mRNA_res <- readRDS(file.path(dir_dest1,"outputs/RDS/mRNA_res.rds"))
# miR
mir_res <- readRDS(file.path(dir_dest1,"outputs/RDS/mir_res.rds"))
# dnam
dnam_res_cgi <- readRDS(file.path(dir_dest1,"outputs/RDS/dnam_res_cgi.rds"))
## 02
# miR-mRNA
targ_mir_mRNA <- readRDS(file.path(dir_dest2,"outputs/RDS/targ_mir_mRNA.rds"))
cor_mir <- readRDS(file.path(dir_dest2,"outputs/RDS/cor_mir.rds"))
# dnam-mRNA
targ_dnam_mRNA <- readRDS(file.path(dir_dest2,"outputs/RDS/targ_dnam_mRNA.rds"))
In this chapter we build the integrated table that will contain multiple information from different data sets.
Firstly, filtering mRNA results to get only significantly DE genes.
mRNA_de <- filter(as.data.frame(mRNA_res),padj < 0.05)
simple_tb <-
data.frame(genes=rownames(mRNA_de)) %>%
mutate(mir=dplyr::case_when(
is.element(genes,targ_mir_mRNA$target_ensembl) ~ 1, TRUE ~ 0)) %>%
mutate(dnam=dplyr::case_when(
is.element(genes, targ_dnam_mRNA$ensembl) ~ 1, TRUE ~ 0))
# table of genes under regulation
table(simple_tb [,c("mir","dnam")])
dnam
mir 0 1
0 1129 776
1 356 732
venn::venn(x =list(simple_tb[simple_tb$dnam ==1,"genes"],
simple_tb[simple_tb$mir ==1,"genes"]),
zcolor = "style",
snames = c("DNA methylated","microRNA"))
To explore the influence of different types of regulation we preform gsea on crude groups of genes where the genes are presneted in the venn. From the gostplot there are not many terms associated with genes that are ONLY under DNA methylation regulation (Only muscle contraction -REAC). In the intersection part there are terms that are involved in cell cycle regulation, regulation of transcription, .. Genes under mir regulation have many terms involved with the immune system, also some terms involved with cell mobility (cell-cell adhesion, cell migration, collagen degradation). The rest that are not under dnam or mir regulation have terms involved in muscle contractions and some immunological pathways. TF database added later, comment: In all groups there are many hits with TF.
# Building multiple queries
query_ls <-
list(dnam = filter(simple_tb, mir == 0 & dnam == 1 )$genes,
mir = filter(simple_tb, mir == 1 & dnam == 0 )$genes,
intersection = filter(simple_tb, mir == 1 & dnam == 1 )$genes,
rest = filter(simple_tb, mir == 0 & dnam == 0 )$genes)
# Sources
GSEA_SOURCES <- c("GO:BP","CORUM","KEGG","REAC","WP","MIRNA", "TF")
gost <-
gost(query = query_ls,
organism= "hsapiens",
exclude_iea=TRUE,
domain_scope="annotated",
ordered_query=F,
sources=GSEA_SOURCES)
# Plot and table of significant results
gostplot(gost, capped = T, interactive = T)
Warning message:
`gather_()` was deprecated in tidyr 1.2.0.
Please use `gather()` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
Building a complete and comprehensive table with results of DE and DM of the transcriptomic (mRNA), small-RNA transcriptomic (mir) and DNA methylation (CpG islands) to have all significant data in one data frame. The comprehensive table will contain all mir mRNA and dnam mRNA pairs. ! add colnames!
# this steps for this are described in the script
integ_df <-
integrateOmicsData(mrna_res_obj = mRNA_res,
mir_res_obj = mir_res,
dnam_res_obj = dnam_res_cgi,
mir_targets_cor = cor_mir,
genes_file = genes_dir)
head(integ_df)
message(paste0("Dimensions of integrated tbl: ", dim(integ_df)[1]))
Dimensions of integrated tbl: 4607
# Summary statistics
summary(integ_df)
gene symbol gene_mean
Length:4607 Length:4607 Min. : 2.07
Class :character Class :character 1st Qu.: 50.97
Mode :character Mode :character Median : 263.10
Mean : 1262.56
3rd Qu.: 876.42
Max. :174035.04
gene_FC gene_pval gene_pajd
Min. :-7.3216 Min. :0.000e+00 Min. :0.000000
1st Qu.:-2.6883 1st Qu.:9.355e-05 1st Qu.:0.002654
Median :-1.3979 Median :9.092e-04 Median :0.012951
Mean :-0.4427 Mean :1.776e-03 Mean :0.016942
3rd Qu.: 1.8978 3rd Qu.:3.027e-03 3rd Qu.:0.028972
Max. : 6.9477 Max. :6.988e-03 Max. :0.049982
mir mir_r mir_r_pval
Length:4607 Min. :-0.9505 Min. :0.0000
Class :character 1st Qu.:-0.4725 1st Qu.:0.0269
Mode :character Median : 0.0385 Median :0.1098
Mean : 0.0021 Mean :0.2098
3rd Qu.: 0.4560 3rd Qu.:0.3064
Max. : 0.9341 Max. :1.0000
NA's :1929 NA's :1929
mir_mean mir_FC mir_pval
Min. : 2.77 Min. :-4.1993 Min. :0.0000
1st Qu.: 95.31 1st Qu.:-1.1996 1st Qu.:0.0001
Median : 392.78 Median : 1.3351 Median :0.0005
Mean : 3165.96 Mean : 0.8145 Mean :0.0030
3rd Qu.: 1232.71 3rd Qu.: 2.4050 3rd Qu.:0.0061
Max. :75885.30 Max. : 5.2725 Max. :0.0100
NA's :1929 NA's :1929 NA's :1929
mir_padj dnam_quot dnam_pval
Min. :0.0000 Min. :-2.6695 Min. :0.0000
1st Qu.:0.0022 1st Qu.:-0.3642 1st Qu.:0.0103
Median :0.0053 Median :-0.1369 Median :0.0376
Mean :0.0156 Mean :-0.0210 Mean :0.0727
3rd Qu.:0.0314 3rd Qu.: 0.1289 3rd Qu.:0.0857
Max. :0.0448 Max. : 3.9967 Max. :0.9481
NA's :1929 NA's :2342 NA's :2342
dnam_padj dnam_num_sites dnam_condensed_ranges
Min. :0.0000 Min. : 1.000 Length:4607
1st Qu.:0.0299 1st Qu.: 5.000 Class :character
Median :0.0693 Median : 7.000 Mode :character
Mean :0.0956 Mean : 7.941
3rd Qu.:0.1171 3rd Qu.:10.000
Max. :0.9492 Max. :44.000
NA's :2342 NA's :2342
Here we want to make a table that has only one row per mRNA so that we can explore the influence of DE mir and dnam that are putative regulators of said mRNA. For multiple mir regulators we choose the best correlating one. For multiple dm cgi’s involved in regulating an mRNA we choose the one with the “best” (larger absolute value) fold change.
reg_df <-
integ_df %>%
# Group by gene to compare their regulators
group_by(gene) %>%
# Storing the best correlating mir into best_cor column
# NA is for those that are not under mir regulation
mutate(best_cor=ifelse(!is.na(mir_r), mir[which.max(abs(mir_r))], NA)) %>%
# filtering out mirs that are not the best putative regulators
dplyr::filter(is.na(mir) | mir == best_cor) %>%
# Storing the cgi with max abs value in best_cgi
mutate(best_cgi= ifelse(!is.na(dnam_quot), dnam_condensed_ranges[which.max(abs(dnam_quot))],
NA)) %>%
# Filtering out cgi that are not the best
dplyr::filter(is.na(dnam_condensed_ranges) |
dnam_condensed_ranges == best_cgi) %>%
# Dropping uneeded columns. This info is already in mir and dnam ranges
dplyr::select(-best_cor, -best_cgi) %>%
ungroup()
reg_df
Using two unsupervised clustering methods SOM and kmeans to explore clusters of genes under different combination of regulations. Features are fold changes of mRNA and mir data and quotient of beta intensity value for dna methylation. We hypothesized that groups with mRNA FC values will be opposite of mir and dnam regulators which would suggest that those groups are under regulation of mir and/or dnam.
K means clusters mRNA genes into k number of clusters in which each mRNA belongs to the cluster with the nearest mean (or cluster centroid). For more details look at the lecture materials from pmf machine learning.
# Prepping kmeans table
k_tb <- data.frame(row.names = reg_df$gene,
dplyr::select(reg_df,
gene_FC,
mir_FC,
dnam_quot))
hist(k_tb$dnam_quot)
# Scaling (standardization - x-mean/sd)
# k_tb <- scale(k_tb) %>%
# as.data.frame()
#k_tb$dnam_quot <- scale(k_tb$dnam_quot)
# Doubling values so that the colors are more intense
k_tb$dnam_quot <- k_tb$dnam_quot*2
hist(k_tb$dnam_quot)
# All NA values are 0
k_tb[is.na(k_tb)] <- 0
kmeansHeatmap(k_tb, clusters= 10:14, nstart = 60)
Testing the plotting function
sessionInfo()